Assignment Instructions:

Assignment submission: Sofia Rodas


# load in necessary libraries
library(tidyverse)
library(here)
library(janitor)
library(estimatr)  
library(performance)
library(jtools)
library(gt)
library(gtsummary)
library(interactions) 
library(ggbeeswarm)
library(ggridges)

Introduction

You’re about to dive into some deep data collected from five reef sites in Santa Barbara County, all about the abundance of California spiny lobsters! Data was gathered by divers annually from 2012 to 2018 across Naples, Mohawk, Isla Vista, Carpinteria, and Arroyo Quemado reefs.

Why lobsters? Well, this sample provides an opportunity to evaluate the impact of Marine Protected Areas (MPAs) established on January 1, 2012 (Reed, 2019). Of these five reefs, Naples, and Isla Vista are MPAs, while the other three are not protected (non-MPAs). Comparing lobster health between these protected and non-protected areas gives us the chance to study how commercial and recreational fishing might impact these ecosystems.

We will consider the MPA sites the treatment group and use regression methods to explore whether protecting these reefs really makes a difference compared to non-MPA sites (our control group). In this assignment, we’ll think deeply about which causal inference assumptions hold up under the research design and identify where they fall short.

Let’s break it down step by step and see what the data reveals!


Step 1: Anticipating potential sources of selection bias

a. Do the control sites (Arroyo Quemado, Carpenteria, and Mohawk) provide a strong counterfactual for our treatment sites (Naples, Isla Vista)? Write a paragraph making a case for why this comparison is ceteris paribus or whether selection bias is likely (be specific!).

No, ceteris paribus is not achieved between the control sites (Arroyo Quemado, Carpenteria, and Mohawk) and the treatment sites (Naples, Isla Vista).The control sites do not provide a strong counter factual. The investigation is interested in analyzing the impact of MPAs on California Spiny Lobster (Panulirus Interruptus). In order to achieve ceteris paribus all other influencers would need to remain constant. However, this is impossible to do in a natural setting. The temperature of the ocean water, ocean currents, fishing pressure, and predator population are all examples of potential sources of dissimilarity between the control and treatment sites.


Step 2: Read & wrangle data

a. Read in the raw data from the “data” folder named spiny_abundance_sb_18.csv. Name the data.frame rawdata

# read in data
rawdata <- read_csv(here::here("data", "spiny_abundance_sb_18.csv"))

b. Use the function clean_names() from the janitor package

# HINT: check for coding of missing values (`na = "-99999"`)

# make all names lower_snake
rawdata <-  janitor::clean_names(rawdata)

# change any -99999 to NA
rawdata$size_mm[rawdata$size_mm == -99999] <- NA

c. Create a new df named tidyata. Using the variable site (reef location) create a new variable reef as a factor and add the following labels in the order listed (i.e., re-order the levels):

"Arroyo Quemado", "Carpenteria", "Mohawk", "Isla Vista",  "Naples"
# factor the reef locations
tidydata <- rawdata |> 
    mutate(reef = factor(rawdata$site, levels = c("AQUE", 
                                                  "CARP", 
                                                  "MOHK", 
                                                  "IVEE", 
                                                  "NAPL"),
                                       labels = c( "Arroyo Quemado", 
                                                   "Carpenteria", 
                                                   "Mohawk", 
                                                   "Isla Vista", 
                                                   "Naples")))

Create new df named spiny_counts

# create spiny_counts df
spiny_counts <- tidydata

d. Create a new variable counts to allow for an analysis of lobster counts where the unit-level of observation is the total number of observed lobsters per site, year and transect.

  • Create a variable mean_size from the variable size_mm
  • NOTE: The variable counts should have values which are integers (whole numbers).
  • Make sure to account for missing cases (na)!
#HINT(d): Use `group_by()` & `summarize()` to provide the total number of lobsters observed at each site-year-transect row-observation. 

# new column addition: mean_size
spiny_counts <- spiny_counts |> 
    group_by(site, year, transect) |> 
    summarize(counts = sum(count, na.rm = TRUE), 
              mean_size = mean(size_mm, na.rm = TRUE)) |> 
    ungroup()

# change the data type of count to integer
spiny_counts$counts <- as.integer(spiny_counts$counts)

e. Create a new variable mpa with levels MPA and non_MPA. For our regression analysis create a numerical variable treat where MPA sites are coded 1 and non_MPA sites are coded 0

#HINT(e): Use `case_when()` to create the 3 new variable columns

# add a mpa column
spiny_counts <- spiny_counts |> 
    mutate(mpa = case_when(
    site == "IVEE" | site == "NAPL" ~ "mpa",
    site == "AQUE" | site == "CARP" | site == "MOHK" ~ "non-mpa"
)) |> 
    mutate( treat = case_when(
        mpa == "mpa" ~ 1,
        mpa == "non-mpa" ~ 0
    ))

NOTE: This step is crucial to the analysis. Check with a friend or come to TA/instructor office hours to make sure the counts are coded correctly!


Step 3: Explore & visualize data

a. Take a look at the data! Get familiar with the data in each df format (tidydata, spiny_counts)

b. We will focus on the variables count, year, site, and treat(mpa) to model lobster abundance. Create the following 4 plots using a different method each time from the 6 options provided. Add a layer (geom) to each of the plots including informative descriptive statistics (you choose; e.g., mean, median, SD, quartiles, range). Make sure each plot dimension is clearly labeled (e.g., axes, groups).

Create plots displaying the distribution of lobster counts:

  1. grouped by reef site
# lobster counts by reef site
spiny_counts |> 
    group_by(site) |> 
ggplot() +
    geom_density_ridges(aes(x = counts, y = site, fill = site),
                        quantile_lines=TRUE,
                        quantile_fun = median) +
    labs(x = "Count",
         y = "Site",
         title = "California Spiny Lobster Count by Reef Site") + 
    scale_fill_manual(values = c("red", "skyblue", "hotpink", "forestgreen", "orange"))  
Figure 1: California Spiny Lobster count by reef site depicting the median lobster count value.

Figure 1: California Spiny Lobster count by reef site depicting the median lobster count value.

Figure 1, the density ridge plot shows that the data from all of the sites are positively skewed data. In other words there are outliers in the data with lobster counts reaching into the hundreds. Mohawk has a number of observations between the counts of 100 and 200. Isla Vista counts taper off more slowly than the other sites with a greater conglomeration of counts between 50 and 100. Isla Vista counts also have some observations between 150 and 300. Carpenteria has a few peaks between 75 and 125, between 150 and 200, and between 240 and 275. Since the data is skewed right, the descriptive statistic of choice is the median. It reduces the influence of the outliers such as would be present if the statistic shown were the mean. Mohawk (non-MPA) and Isla Vista (MPA) have the highest medians.

  1. grouped by MPA status
# lobster counts by mpa status
spiny_counts |> 
    group_by(mpa) |> 
ggplot() +
    geom_violin(aes(x = mpa, y = counts, fill = mpa),
                draw_quantiles = c(0.25, 0.5, 0.75)) +
    labs(x = "MPA",
         y = "Count",
         title = "California Spiny Lobster Count by MPA Status") +
    scale_fill_manual(values = c("#D77A61", "#5F7470"))
Figure 2: California Spiny Lobster count by MPA status depicting the quartile ranges for lobster count values.

Figure 2: California Spiny Lobster count by MPA status depicting the quartile ranges for lobster count values.

Figure 2, the violin plot above shows that non-MPA sites have more counts that are below approximately 25 lobsters. Many MPA sites have low counts as well however, the 75 percentile mark is located at a higher count for MPA sites compared to non-MPA sites. This signifies that proportionally, MPA sites have a higher chance of having more lobsters than non-MPA sites.

  1. grouped by year
# lobster counts by year
spiny_counts |> 
    group_by(year) |> 
    mutate(year = factor(year)) |> 
ggplot() +
    geom_boxplot(aes(x = counts, y = year, color = year),
                 outlier.shape = NA) +
    geom_beeswarm(aes(x = counts, y = year, color = year)) +
    labs(x = "Count",
         y = "Site",
         title = "California Spiny Lobster Count by Year") +      
    scale_color_manual(values = c("#640D14", 
                                  "skyblue", 
                                  "black", 
                                  "forestgreen", 
                                  "#D77A61", 
                                  "navy", 
                                  "hotpink"))
Figure 3: California Spiny Lobster count by year depicting the quartile ranges for lobster count values.

Figure 3: California Spiny Lobster count by year depicting the quartile ranges for lobster count values.

Firgure 3, the boxplot and beeswarm figure shows that lobster counts increase as more time passes since the creation of MPAs in 2012. This figure shows that lobster populations increase fairly steadily from 2012 to 2018. 2016 is an outlier year that has fewer lobsters than 2015 as shown by the boxplot. Though more analysis is necessary, the creation of MPAs appears to positively impact lobster populations.

Create a plot of lobster size :

  1. You choose the grouping variable(s)!
# create a new df to call to show mean on the plot
mean_size <- spiny_counts |> 
    group_by(mpa) |> 
    summarise(mean_mean_size = mean(mean_size, na.rm = TRUE))

# assign mean to variables 
mpa_mean <- mean_size$mean_mean_size[1]
non_mpa_mean <- mean_size$mean_mean_size[2]

# plot MPA and non-MPA zones mean-size in a histogram
spiny_counts |> 
    group_by(mpa) |> 
    ggplot(aes(x = mean_size, 
                fill = mpa)) +
    geom_histogram(alpha = 0.5) +
    scale_fill_manual(values = c("navy", "hotpink")) +
    geom_vline(xintercept = c(mpa_mean, non_mpa_mean),
               color = c("navy", "hotpink")) +
    labs(y = "Count",
         x = "Mean Size",
         title = "Mean Lobster Size in MPA and Non-MPA Zones")
Figure 4: California Spiny Lobster counts by mean size destinguishing between MPA statuses. The vertical lines are depicting the mean size for each treatment, MPA and non-MPA.

Figure 4: California Spiny Lobster counts by mean size destinguishing between MPA statuses. The vertical lines are depicting the mean size for each treatment, MPA and non-MPA.

The histogram above shows that MPA sites have a larger mean lobster size than non-MPA sites.

c. Compare means of the outcome by treatment group. Using the tbl_summary() function from the package gt_summary

# USE: gt_summary::tbl_summary()
gtsummary::tbl_summary(spiny_counts, by = treat)
Characteristic 0
N = 133
1
1
N = 119
1
site

    AQUE 49 (37%) 0 (0%)
    CARP 63 (47%) 0 (0%)
    IVEE 0 (0%) 56 (47%)
    MOHK 21 (16%) 0 (0%)
    NAPL 0 (0%) 63 (53%)
year

    2012 19 (14%) 17 (14%)
    2013 19 (14%) 17 (14%)
    2014 19 (14%) 17 (14%)
    2015 19 (14%) 17 (14%)
    2016 19 (14%) 17 (14%)
    2017 19 (14%) 17 (14%)
    2018 19 (14%) 17 (14%)
transect

    1 21 (16%) 14 (12%)
    2 21 (16%) 14 (12%)
    3 21 (16%) 14 (12%)
    4 14 (11%) 14 (12%)
    5 14 (11%) 14 (12%)
    6 14 (11%) 14 (12%)
    7 14 (11%) 14 (12%)
    8 7 (5.3%) 14 (12%)
    9 7 (5.3%) 7 (5.9%)
counts 9 (4, 22) 13 (3, 34)
mean_size 73 (68, 77) 77 (71, 80)
    Unknown 15 12
mpa

    mpa 0 (0%) 119 (100%)
    non-mpa 133 (100%) 0 (0%)
1 n (%); Median (Q1, Q3)

The mean size of lobsters in non-MPA sites is 73 mm while the mean size of lobsters in MPA sites is 77 mm.

Step 4: OLS regression- building intuition

a. Start with a simple OLS estimator of lobster counts regressed on treatment. Use the function summ() from the jtools package to print the OLS output

b. Interpret the intercept & predictor coefficients in your own words. Use full sentences and write your interpretation of the regression results to be as clear as possible to a non-academic audience.

# NOTE: We will not evaluate/interpret model fit in this assignment (e.g., R-square)

m1_ols <- lm(counts ~ treat, data = spiny_counts)

summ(m1_ols, model.fit = FALSE) 
Observations 252
Dependent variable counts
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 22.73 3.57 6.36 0.00
treat 5.36 5.20 1.03 0.30
Standard errors: OLS

According to the OLS linear regression model, on average there are 22.73 lobsters in non-MPA sites. MPA sites have on average 5.36 more lobsters than non-MPA sites. Though the OLS linear regression model is convenient because the estimates for the intercept and coefficients are in counts of lobsters, it is impossible to have fractions of lobsters and may not be the best statistical model to analyze the difference in lobster counts between MPA and non-MPA sites.

c. Check the model assumptions using the check_model function from the performance package

d. Explain the results of the 4 diagnostic plots. Why are we getting this result?

check_model(m1_ols,  check = "qq" )

check_model(m1_ols, check = "normality")

The two plots above show that the residuals of the OLS linear regression model run on the lobster data does not follow what is expected for an OLS model that fits the data well. The observed data and the modelled data follow different trends. In other words a different model should be used for this data because there is a high dispersion of the residuals.

check_model(m1_ols, check = "homogeneity")

The reference line in the homogeneity of variance is not shown on the plot above. The homogeneity of variance should be flat but the observed data fit on lobster counts is curved. The variance is not homogenous as the variance values follow the curved line shown above. This again shows that the OLS linear regression model is not a good fit for the lobster count data. A model that does not make on the assumption of homogeneity of variance should be utilized to analyze lobster count between MPA and non-MPA sites.

check_model(m1_ols, check = "pp_check")

The blue line is the simulation of the model while the green line is the observed data. The data is simulated numerous times showing hence the many blue lines on the graph above. The observed data is the outlier of all of the density curves. If the OLM linear model were a good fit for the the data the observed data line and the model-predicted data lines would follow a similar trend. The model check again shows that the OLM linear model is not the appropriate model for this data.


Step 5: Fitting GLMs

a. Estimate a Poisson regression model using the glm() function

b. Interpret the predictor coefficient in your own words. Use full sentences and write your interpretation of the results to be as clear as possible to a non-academic audience.

c. Explain the statistical concept of dispersion and overdispersion in the context of this model.

d. Compare results with previous model, explain change in the significance of the treatment effect

#HINT1: Incidence Ratio Rate (IRR): Exponentiation of beta returns coefficient which is interpreted as the 'percent change' for a one unit increase in the predictor 

#HINT2: For the second glm() argument `family` use the following specification option `family = poisson(link = "log")`

m2_pois <- glm(counts ~ treat, 
               family = poisson(link = "log"),
               data = spiny_counts)

summ(m2_pois)
Observations 252
Dependent variable counts
Type Generalized linear model
Family poisson
Link log
𝛘²(1) 71.36
p 0.00
Pseudo-R² (Cragg-Uhler) 0.25
Pseudo-R² (McFadden) 0.01
AIC 11365.62
BIC 11372.68
Est. S.E. z val. p
(Intercept) 3.12 0.02 171.74 0.00
treat 0.21 0.03 8.44 0.00
Standard errors: MLE

Because the Poisson regression model calculates statistical values in the link space, the values must be taken out of the link space for applicable world interpretations.

# reverse the link space 
lobster_count <- round(exp(m2_pois$coef[1]), 2)
mpa_treat <- exp(m2_pois$coef[2])

# calculate percent change of treatment as an MPA site
perc_lob_inc <- round((mpa_treat - 1) *100, 2)

There are on average approximately 22.73 in non-MPA sites according to the Poisson regression model. On average, there is an approximate 23.6 increase in the percentage of lobsters in MPA sites.

Poisson models assume that variance is proportional to the mean. When the variance is larger than the mean the model is over dispersed.

e. Check the model assumptions. Explain results.

f. Conduct tests for over-dispersion & zero-inflation. Explain results.

check_model(m2_pois)

The posterior predictive check, the mis-specified dispersion and zero-inflation check, and the distribution of quantile residuals do show that the observed data and the model-predicted data do not follow the same trends. The model does not accurately fit the data for these assumptions. The homogeneity of variance and influential observations are modeled well by the Poisson regression. The Poisson regression model assumes dispersion of the variance dependent on the mean.

check_overdispersion(m2_pois)
## # Overdispersion test
## 
##        dispersion ratio =    67.033
##   Pearson's Chi-Squared = 16758.289
##                 p-value =   < 0.001

The dispersion ratio is 67.033 which is an extremely large value. The dispersion ratio should be equal to 1. The observed data is over dispersed when utilizing this model. The over dispersion of the data indicates that a Poisson regression model is not a good fit for the data.

check_zeroinflation(m2_pois)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 0
##             Ratio: 0.00

The observed zeros in this data are 27. This indicates that there are numerous zeros in the data. The Poisson regression model does not handle zeros well signifying an alternative model should be utilized to accurately analyze the data.

g. Fit a negative binomial model using the function glm.nb() from the package MASS and check model diagnostics

h. In 1-2 sentences explain rationale for fitting this GLM model.

The negative binomial GLM model is a better fit for the data than the Poisson regression model in this case because of the large amount of zeros in the data. Negative binomial models account for over dispersion by adding an additional parameter.

i. Interpret the treatment estimate result in your own words. Compare with results from the previous model.

library(MASS) ## NOTE: The `select()` function is masked. Use: `dplyr::select()` ##
# NOTE: The `glm.nb()` function does not require a `family` argument

m3_nb <- glm.nb(counts ~ treat,
                data = spiny_counts)

summ(m3_nb)
Observations 252
Dependent variable counts
Type Generalized linear model
Family Negative Binomial(0.55)
Link log
𝛘²(250) 1.52
p 0.22
Pseudo-R² (Cragg-Uhler) 0.01
Pseudo-R² (McFadden) 0.00
AIC 2088.53
BIC 2099.12
Est. S.E. z val. p
(Intercept) 3.12 0.12 26.40 0.00
treat 0.21 0.17 1.23 0.22
Standard errors: MLE

Because the Negative Binomial calculates statistical values in the link space, the values must be taken out of the link space for applicable world interpretations.

# reverse the link space 
nb_lobster_count <- round(exp(m3_nb$coef[1]), 2)
nb_mpa_treat <- exp(m3_nb$coef[2])

# calculate percent change of treatment as an MPA site
nb_perc_lob_inc <- round((nb_mpa_treat -1) *100, 2)

There are on average approximately 22.73 in non-MPA sites according to the negative binomial model. On average, there is an approximate 23.6 increase in the percentage of lobsters in MPA sites. These are the same values as those acquired by analyzing the data with Poisson regression model.

check_overdispersion(m3_nb)
## # Overdispersion test
## 
##  dispersion ratio = 1.398
##           p-value = 0.088

The over dispersion test is close to 1. It is less than 5 and therefore within the range of accurately accounting for dispersion. The negative binomial model accurately accounts for the data’s dispersion in its statistical analysis.

check_zeroinflation(m3_nb)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 30
##             Ratio: 1.12

The observed number of zeros is close to the predicted number of zeros meaning that the model accurately accounts for the zeros in its statistical analysis. The ratio is also close to 1 indicating that the observed and predicted zero-inflations are close. The negative binomial model accurately accounts for the data’s zero-inflation in its statistical analysis.

check_predictions(m3_nb)

The observed and predicted data follow the same trends in the posterior predictive check. The negative binomial model accurately fits the data and shows the statistical trends.

check_model(m3_nb)

The influential observations and the distribution of quantile residuals show the same trends between the observed and model-predicted data. The misspecified dispersion and zero-inflation assumption is not perfectly aligned between the predicted and observed data. The predicted data is within the confidence interval of the observed misspecified dispersion and zero-inflation data but does not perfectly follow the same values. This is an improvement from the Poisson model assumption checks. The negative binomial model assumes that there is dispersion in the variance and is therefore a good statistical model fit for the data.


Step 6: Compare models

a. Use the export_summ() function from the jtools package to look at the three regression models you fit side-by-side.

c. Write a short paragraph comparing the results. Is the treatment effect robust or stable across the model specifications.

export_summs(m1_ols, m2_pois, m3_nb,
             model.names = c("OLS","Poisson", "NB"),
             statistics = "none")
OLSPoissonNB
(Intercept)22.73 ***3.12 ***3.12 ***
(3.57)   (0.02)   (0.12)   
treat5.36    0.21 ***0.21    
(5.20)   (0.03)   (0.17)   
*** p < 0.001; ** p < 0.01; * p < 0.05.
# transform the percent lobster values to count
mpa_lobster_count <- 22.73 * 0.23

The values from the OLS model is in lobster counts while the values in the Poisson and Negative Binomial models are in the link space. These values cannot be easily compared across the three types of models. Transformations are required to make all of these values into lobster counts.

As described above, the values for the Poisson regression model and negative binomial model are the same. The average number of lobsters in non-MPA sites is equal across the three models. However, the Poisson regression model and the negative binomial model have slightly lower with an on average 5.23 lobster increase for MPA sites. All of these models provide relatively similar lobster counts once translated into the response space.


Step 7: Building intuition - fixed effects

a. Create new df with the year variable converted to a factor

b. Run the following negative binomial model using glm.nb()

  • Add fixed effects for year (i.e., dummy coefficients)
  • Include an interaction term between variables treat & year (treat*year)

c. Take a look at the regression output. Each coefficient provides a comparison or the difference in means for a specific sub-group in the data. Informally, describe the what the model has estimated at a conceptual level (NOTE: you do not have to interpret coefficients individually)

d. Explain why the main effect for treatment is negative? *Does this result make sense?

ff_counts <- spiny_counts |> 
    mutate(year=as_factor(year))
    
m5_fixedeffs <- glm.nb(
    counts ~ 
        treat +
        year +
        treat*year,
    data = ff_counts)

summ(m5_fixedeffs, model.fit = FALSE)
Observations 252
Dependent variable counts
Type Generalized linear model
Family Negative Binomial(0.8129)
Link log
Est. S.E. z val. p
(Intercept) 2.35 0.26 8.89 0.00
treat -1.72 0.42 -4.12 0.00
year2013 -0.35 0.38 -0.93 0.35
year2014 0.08 0.37 0.21 0.84
year2015 0.86 0.37 2.32 0.02
year2016 0.90 0.37 2.43 0.01
year2017 1.56 0.37 4.25 0.00
year2018 1.04 0.37 2.81 0.00
treat:year2013 1.52 0.57 2.66 0.01
treat:year2014 2.14 0.56 3.80 0.00
treat:year2015 2.12 0.56 3.79 0.00
treat:year2016 1.40 0.56 2.50 0.01
treat:year2017 1.55 0.56 2.77 0.01
treat:year2018 2.62 0.56 4.69 0.00
Standard errors: MLE

Because the Negative Binomial calculates statistical values in the link space, the values must be taken out of the link space for applicable world interpretations.

# reverse the link space 
mix_lobster_count <- round(exp(m5_fixedeffs$coef[1]), 2)
mix_mpa_treat <- round(exp(m5_fixedeffs$coef[1] + m5_fixedeffs$coef[2]), 2)

The intercept shows the average number of lobsters within non-MPA sites in 2012, when it is taken out of the link space. On average there are approximately 10.47 in 2012 for non-MPA sites. The treat variable shows the average number of lobsters for MPA sites in 2012 when taken out of the link space. On average there are approximately 1.88 in 2012 for MPA sites. The individual years without the interactive treat variable show the lobster counts for any given year for non-MPA site, when transformed into the response space. The treat and year interaction variables show the lobster counts for any given year for MPA sites, when transformed into the response space.

The main effect of treat is negative because the values date back to the creation of the MPAs. MPAs are protection areas that are typically created when ecosystems are in decline. The MPAs were likely created where the lobster populations were the lowest hence the low values are distorting the true effects of MPA site creation. This negative result makes sense.

e. Look at the model predictions: Use the interact_plot() function from package interactions to plot mean predictions by year and treatment status.

f. Re-evaluate your responses (c) and (b) above.

interact_plot(m5_fixedeffs, pred = year, modx = treat,
              outcome.scale = "response") # NOTE: y-axis on response-scale

# HINT: Change `outcome.scale` to "response" to convert y-axis scale to counts

g. Using ggplot() create a plot in the same style as the previous interaction plot, but displaying the original scale of the outcome variable (lobster counts). This type of plot is commonly used to show how the treatment effect changes across discrete time points (i.e., panel data).

The plot should have… - year on the x-axis - counts on the y-axis - mpa as the grouping variable

# Hint 1: Group counts by `year` and `mpa` and calculate the `mean_count`
# Hint 2: Convert variable `year` to a factor

plot_counts <- ff_counts |> 
    group_by(year, mpa) |> 
    summarize(mean_count = mean(counts)) |> 
    mutate(year = as.factor(year)) |> 
    ggplot(aes(x = year, y = mean_count, color = mpa, group = mpa)) + 
        geom_line() +     
        geom_point() +
    scale_color_manual(values = c("navy", "hotpink"))+
    labs(x = "Year",
         y = "Mean Counts",
         title = "Mean Lobster Counts per Year: MPA versus Non-MPA Sites")

plot_counts

The mean lobster counts per year graph shows how the treatment effect changes across time. This graph is the same graph that outputs from graphing the negative binomial model outcomes in the response space when treatment, year, and a treatment and year interaction term are included in the model.


Step 8: Reconsider causal identification assumptions

  1. Discuss whether you think spillover effects are likely in this research context (see Glossary of terms; https://docs.google.com/document/d/1RIudsVcYhWGpqC-Uftk9UTz3PIq6stVyEpT44EPNgpE/edit?usp=sharing)

Yes, the spillover effect is likely to be occurring in the counts of lobsters between MPA and non-MPA sites. Lobsters can move between sites, as they are migratory creatures. Though lobsters having a homing inclination, there is a possibility of increased number of lobsters in MPA sites positively impacting the number of lobsters in non-MPA sites.

  1. Explain why spillover is an issue for the identification of causal effects

Spillover reduces the differences between the control and treatment groups. In other words, the effect of the treatment may be influencing the non-treatment counts.

  1. How does spillover relate to impact in this research setting?

Spillover may show that MPA sites are less effective than they are in reality because the effect of MPA sites may increase lobster counts in non-MPA sites. This can negatively impact the argument for more MPAs on the basis that it is hard to definitively state that the creation of MPAs is driving up lobster counts.

  1. Discuss the following causal inference assumptions in the context of the MPA treatment effect estimator. Evaluate if each of the assumption are reasonable:

    1. SUTVA: Stable Unit Treatment Value assumption.

SUTVA cannot be assumed in the context of MPA treatment sites. As discussed above, there is spillover in the lobster counts between the MPA and non-MPA sites.

2)  Exogeneity assumption. 

Exogeneity cannot be assumed in the context of MPA treatment sites. Though the increase in lobster counts in MPA sites is likely a result of the MPA site creations, the locations that were selected for MPA sites were not randomly assigned. Those areas selected to be MPA sites may have other characteristics that are causing the increased lobster count outcome.


EXTRA CREDIT

Use the recent lobster abundance data with observations collected up until 2024 (extracredit_sblobstrs24.csv) to run an analysis evaluating the effect of MPA status on lobster counts using the same focal variables.

  1. Create a new script for the analysis on the updated data
  2. Run at least 3 regression models & assess model diagnostics
  3. Compare and contrast results with the analysis from the 2012-2018 data sample (~ 2 paragraphs)